Skip to content

[Metax][Optimization] Optimize PaddleOCR-VL vision path on Metax GPU#7619

Open
Dryoung95 wants to merge 5 commits intoPaddlePaddle:developfrom
Dryoung95:codex/opt5-metax-paddleocr-vl
Open

[Metax][Optimization] Optimize PaddleOCR-VL vision path on Metax GPU#7619
Dryoung95 wants to merge 5 commits intoPaddlePaddle:developfrom
Dryoung95:codex/opt5-metax-paddleocr-vl

Conversation

@Dryoung95
Copy link
Copy Markdown
Contributor

@Dryoung95 Dryoung95 commented Apr 24, 2026

Motivation

This PR optimizes the PaddleOCR-VL vision path on Metax GPU.

During profiling, extra overhead was observed around extract_vision_features_paddleocr(), especially in vision metadata preparation, position embedding preparation, and projector-side data
organization. This PR reduces unnecessary host/device synchronization and repeated small tensor operations while keeping the existing vision computation semantics unchanged.

Modifications

This PR updates the following files:

  • fastdeploy/model_executor/models/paddleocr_vl/projector.py
  • fastdeploy/model_executor/models/paddleocr_vl/siglip.py
  • fastdeploy/model_executor/models/paddleocr_vl/siglip_ops.py
  • fastdeploy/worker/metax_model_runner.py
  • tests/model_executor/test_paddleocr_vl_vision_path.py

Main changes:

  1. Optimize PaddleOCR-VL projector-side packing flow and support returning packed image features directly.
  2. Reuse host-side grid_thw_lst metadata in extract_vision_features_paddleocr() to avoid unnecessary tensor-to-CPU synchronization on the FD_ENABLE_MAX_PREFILL=1 path.
  3. Optimize packed position embedding preparation in Siglip vision embeddings.
  4. Support the batch=1 fast path in Siglip attention and encoder layer while sharing the encoder-layer computation logic.
  5. Make rotary embedding precision explicit by requiring float32 input in apply_rotary_pos_emb_vision().
  6. Add unit tests for the changed PaddleOCR-VL vision-path logic.

Usage or Command

Unit test added for this PR:

  python -m pytest -q tests/model_executor/test_paddleocr_vl_vision_path.py

Local validation commands:

  python -m ruff check \
    fastdeploy/model_executor/models/paddleocr_vl/projector.py \
    fastdeploy/model_executor/models/paddleocr_vl/siglip.py \
    fastdeploy/model_executor/models/paddleocr_vl/siglip_ops.py \
    fastdeploy/worker/metax_model_runner.py \
    tests/model_executor/test_paddleocr_vl_vision_path.py

  python -m black --check \
    fastdeploy/model_executor/models/paddleocr_vl/siglip.py \
    fastdeploy/model_executor/models/paddleocr_vl/siglip_ops.py \
    fastdeploy/worker/metax_model_runner.py \
    tests/model_executor/test_paddleocr_vl_vision_path.py

Performance validation on Metax GPU:

  • Single request: 7.4039s -> 7.3870s, about -0.23%
  • 4x8 small concurrency average: about -6.07%
  • 4x8 small concurrency P50: about -5.97%
  • 4x8 small concurrency P95: about -9.73%

Accuracy Tests

This PR keeps the PaddleOCR-VL vision math semantics unchanged and only reduces unnecessary data organization and metadata movement overhead.

Validation performed:

  • Service starts successfully.
  • /health returns 200.
  • /v1/models returns normally.
  • Single OCR request returns normally.
  • 4x8 small concurrency requests complete stably.
  • Added unit tests covering projector packing, Siglip batch=1 fast path, position embedding cache/cast path, and native Neox rope embedding 2D/3D paths.

Local test result:

  python -m pytest -q tests/model_executor/test_paddleocr_vl_vision_path.py
  # 7 passed

Checklist

  • Add at least a tag in the PR title.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented Apr 24, 2026

Thanks for your contribution!

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 24, 2026

CLA assistant check
All committers have signed the CLA.

@Dryoung95 Dryoung95 force-pushed the codex/opt5-metax-paddleocr-vl branch from c6b3374 to 162c6f5 Compare April 24, 2026 16:54
@Dryoung95
Copy link
Copy Markdown
Contributor Author

目前没有看到编译失败、测试失败或 smoke test 失败的代码级证据,更像是 runner / Jenkins remoting 层问题。

麻烦帮忙 rerun 一下这条检查。

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 27, 2026

Codecov Report

❌ Patch coverage is 93.67089% with 5 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@73f11e0). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...y/model_executor/models/paddleocr_vl/siglip_ops.py 80.00% 2 Missing and 2 partials ⚠️
...eploy/model_executor/models/paddleocr_vl/siglip.py 96.42% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7619   +/-   ##
==========================================
  Coverage           ?   71.87%           
==========================================
  Files              ?      396           
  Lines              ?    55493           
  Branches           ?     8689           
==========================================
  Hits               ?    39884           
  Misses             ?    12854           
  Partials           ?     2755           
Flag Coverage Δ
GPU 71.87% <93.67%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

@Dryoung95
Copy link
Copy Markdown
Contributor Author

@luotao1 麻烦老师触发一下CI

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented Apr 28, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-04-29 13:33:06

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

⏳ CI 仍在运行中,4 个 Required 任务尚未完成,暂无失败任务,请等待完成后查看最终结果。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
28(0) 28 21 0 5 2 0

2 任务状态汇总

2.1 Required任务 : 4/8 通过

必选任务阻塞合并,目前 4 个仍在运行中,请等待完成。

状态 任务 耗时 日志 重跑
Extracted partial CE model tasks to run in CI. / run_ce_cases - CI 详情 -
Run Base Tests / base_tests - CI 详情 -
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage - CI 详情 -
Run Four Cards Tests / run_4_cards_tests - CI 详情 -
其余 4 个必选任务通过 - - -

2.2 可选任务 — 17/20 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Trigger Jenkins for PR - CI 详情 -
⏸️ CI_HPU - - -
⏸️ Run iluvatar Tests / run_iluvatar_cases - - -
其余 17 个可选任务通过 - - -

3 失败详情(仅 required)

无 required 失败任务。

@Dryoung95 Dryoung95 changed the title [Metax][Optimization] 优化 PaddleOCR-VL 在 Metax GPU 上的视觉路径开销 [Metax][Optimization] Optimize PaddleOCR-VL vision path on Metax GPU Apr 28, 2026
PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-04-29 13:28:10

📋 Review 摘要

PR 概述:针对 Metax GPU 优化 PaddleOCR-VL 视觉路径,减少 host/device 同步和重复小张量操作,提升并发推理性能。
变更范围model_executor/models/paddleocr_vl/(projector、siglip、siglip_ops)、worker/metax_model_runner.py、单测
影响面 Tag[Models] [Metax] [Optimization]

📝 PR 规范检查

标题包含 [Metax][Optimization] 均为官方 Tag,格式合规;描述包含 Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist 全部必填 section,内容充实,Checklist 勾选状态与 diff 一致。✅ PR 规范合规。

问题

级别 文件 概述
🟡 建议 siglip.py:131 assert 用于运行时 batch=1 校验,Python -O 下失效
🟡 建议 metax_model_runner.py:507 assert 用于运行时 grid_thw 一致性校验,Python -O 下失效
❓ 疑问 siglip_ops.py:40 apply_rotary_pos_emb_vision 接口变更为强制 float32,需确认 neox_rope_embedding 已同步

总体评价

本 PR 通过批量化投影、融合元数据构造、LFU 位置编码缓存复用、mm_hash 去重等手段系统性地降低了视觉路径开销,思路清晰,单测覆盖完整。两处 assert 用于运行时防御性校验建议替换为 raise ValueError,以避免 Python 优化模式下静默失效;apply_rotary_pos_emb_vision 的 float32 接口约束请作者补充说明 neox_rope_embedding 侧是否同步。

):
B, seq_length, D = hidden_states.shape
if hidden_states.dim() == 3:
assert hidden_states.shape[0] == 1, f"SiglipAttention only supports batch=1, got {hidden_states.shape}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 assert 被用于运行时 shape 校验

在 Python -O 模式下 assert 会被完全跳过,导致校验静默失效。

建议改为显式异常:

if hidden_states.shape[0] != 1:
    raise ValueError(
        f"SiglipAttention only supports batch=1, got {hidden_states.shape}"
    )

def apply_rotary_pos_emb_vision(x, cos, sin):
orig_dtype = x.dtype
x = x.astype("float32")
assert x.dtype == paddle.float32, f"expected float32, got {x.dtype}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 apply_rotary_pos_emb_vision 接口变更:要求调用方保证 float32 输入

native_neox_rope_embedding 已在调用前正确完成 cast。但 SiglipAttention.forward 实际调用的是签名为 (qkv, cos, sin, num_heads, head_dim)neox_rope_embedding(疑为 custom op),其未出现在本 PR diff 中。若该函数内部直接调用 apply_rotary_pos_emb_vision 且不保证 float32 输入,将在运行时触发 assert 失败。

请确认 neox_rope_embedding 已同步处理 float32 保证,或说明其不经过此函数。

else:
grid_thw_tensor = paddle.to_tensor(grid_thw_key, dtype=paddle.int64)
multi_vision_inputs["images_lst"].append(
paddle.to_tensor(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 assert 被用于运行时数据一致性校验

Python -O 模式下会失效,建议改为:

if pending_mm_grid_thw[mm_hash] != grid_thw_key:
    raise ValueError(
        f"mm_hash {mm_hash} grid_thw mismatch: "
        f"{pending_mm_grid_thw[mm_hash]} != {grid_thw_key}"
    )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants